Descriptive Statistics: Central Tendency and Dispersion
POLS 3312: Argument, Data, and Politics

Tom Hanna

2024-01-29

This Week’s Agenda

  1. Basic statistics that describe the data
  2. Describing one variable at a time
  3. Central tendency: averages, describing the typical result
  4. Dispersion: describing the variation and range of results
  5. Reading Academic Articles

Descriptive Statistics

Why do we use descriptive statistics?

  • Explore the data
  • See patterns in the data
  • Communicate about the data

What are the most basic things we need to know?

  • What is the scope of the data (time, geography, cases)?
  • What are the variables?
  • What is the unit of observation?

Example

“This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.”

  Class    Sex   Age Survived Freq
1   1st   Male Child       No    0
2   2nd   Male Child       No    0
3   3rd   Male Child       No   35
4  Crew   Male Child       No    0
5   1st Female Child       No    0
6   2nd Female Child       No    0

This data is formatted the way we typically like to use data:

  • units of observation: rows
  • variables: columns

BEWARE! Not all data is formatted this way! Sometimes you have to think “is this a variable or a unit of observation?”

For example data is often presented with variables as rows and units of observation as columns. That’s the easy case.

Sometimes, we get data in a mixed format called wide format. For example, the following data on Scandinavian temperatures:

  country avgtemp.1994 avgtemp.1995 avgtemp.1996
1  Sweden            5            5            9
2 Denmark            9            9            4
3  Norway            8            5            8

It looks like the unit of observation is country and the variable is a combination of year and temperature.

If we look at it in the long format we are used to, it’s a little clearer:

  country year avgtemp
1  Sweden 1994       5
2 Denmark 1994       9
3  Norway 1994       8
4  Sweden 1995       5
5 Denmark 1995       9
6  Norway 1995       5
7  Sweden 1996       9
8 Denmark 1996       4
9  Norway 1996       8

The variable is average temperature.

The unit of observation is actually not country - it’s country-year. Sweden-1994 is one observation, Sweden-1995 is a different observation, and Sweden-1996 is a different observation all with different temperature values.

Measures of Central Tendency

Measures of central tendency help us:

  • reveal patterns
  • find the typical measurement
  • find the center

Measures of Central Tendence

A few numbers that can summarize the center of measurement

  • Mean

  • Median

  • Mode

Mean

  • Symbol: \(\bar{x}\)
  • Not the middle value
  • Not the most common
  • The center of mass - the sum above equals the sum below
  • Formula is \(\bar{x} = \frac{\sum X_i}{n}\)
  • Read that: The mean of X equals the sum of the observations (i) of X divided by the number (n) of observations.

Example A:

A. What is the mean of 1,5,7,9,10,12,18

Example A:

A. What is the mean of 1,5,7,9,10,12,18

[1] 8.857143
[1] 8.857143

Example B

B. What is the mean of 10,20,25,30,35,40,45,50,55

Example B

B. What is the mean of 10,20,25,30,35,40,45,50,55

[1] 34.44444
[1] 34.44444

Median

  • Midpoint
  • Half observations are greater, half are lower
  • Just count
  • Even observations - midpoint between middle two

Example A

A - 1,5,7,9,10,12,18

Example A

A - 1,5,7,9,10,12,18

[1] 9

Example B

B - 10,20,25,30,35,40,45,50,55

Example B

B - 10,20,25,30,35,40,45,50,55

[1] 35

Keep in mind for later

In both of our examples, the mean and median were close but not the same. That isn’t always the case.

Mode

  • Most common value
  • Just count

Examples:

C. 1,2,3,4,4,5,6,7

Answer:

D. 10,20,30,30,40,40,40,50,50,60,70

Answer:

Advantages and disadvantages

  • Median isn’t affected by outliers

  • Mean gives the broader picture because it includes the outliers.

  • Mode is the only option for categorical variables.

  • We will discuss types of variables more in an upcoming class.

Skewed distribution - when mean and median are different

The three numbers are often different for the same sample or population.

Example:

Negatively skewed, Normal, and Positively Skewed distributions

Measures of Dispersion (Variation or Spread)

  • Sample data from the USArrests dataset from the R statistical programming software
  • State level data
  • 50 observations

Look at the data

            Assault UrbanPop
Alabama         236       58
Alaska          263       48
Arizona         294       80
Arkansas        190       50
California      276       91
Colorado        204       78
Connecticut     110       77
Delaware        238       72
Florida         335       80
Georgia         211       60
Hawaii           46       83
Idaho           120       54

Find the Center (Mean)

  • Finding the mean is the first step
  • We need to know the center to find the spread around the center
  • mean is part of the formula for variance

Mean Assault Arrests per 100,000 population:

[1] 170.76

Mean Urban Population Percentage:

[1] 65.54

Find the Center (Median)

  • Why? We want to know if the data is skewed
Median Assault Arrests
[1] 159
Median Urban Population
[1] 66

Compare the Mean and Median

  • If the mean and median are close, the data is not skewed

Mean Assault Arrests:

[1] 170.76

Median Assault Arrests:

[1] 159

Mean Urban Population:

[1] 65.54

Median Urban Population:

[1] 66

Scattered around the mean

  • Measures of dispersion typically look at how the data is scattered around the mean.

  • Let’s look at that visually.

  • First the mean of Assault

  • Then the mean of Urban Population

Scattered around the mean: Assault Arrests

Scattered around the mean: Urban Population

Creating a measure of dispersion: distance to mean

  • So, we could define a measure of dispersion or variation that is the total length of the colored lines.
  • Our formula in English would be “the sum of the differences between each observation and the mean”

Problem with sum of distances

The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!

Simple Data Example

Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.

Example

Point 1
[1] 5
Point 2
[1] 15
The mean
[1] 10

Distance from mean

Distance 1
[1] -5
Distance 2
[1] 5

Distance from Mean Total

So, we want our new measure total_variation to equal the sum of the distances, which would be 10. But when we add 5 plus -5 we get:

The variation is:
[1] 0

Math comes to the rescue!

  • What is something we can do that turns a negative number into a positive number every time and leaves a positive number as a positive?

  • It’s also important that any effect it has on the actual size of the numbers is consistent between positive and negative numbers.

  • We can square the distances

[1] 25
[1] 25

Results

  • Squaring 5 turned it into 25
  • Squaring -5, which is the same size but negative, also turned it into 25.
  • So, now we can add them to get a measure of total_squared_variation.
Total squared variation is:
[1] 50

Are we done?

  • Suppose we had 1000 observations
  • Mean still 10
  • Each still 5 points away on average
  • What would our total variation be?

Given that the actual average distances is exactly the same for both groups, does that make sense? Is it useful?

Solution: Average Squared Difference - Variance

  • We want the average the squared differences

  • So our measure of variance is in the simplest form:

The variance:
[1] 25

This is actually the population variance for this simple data example.

Squares inflate the results

  • Squares inflate the numbers relative to the size of the mean.
  • 25 is 2.5 times the mean.
  • But the distances aren’t really that big
  • Average distance is still 5

Solution: Square root of the variance

  • To partially account for this we can take the square root of the variance
  • This gets us back to our original units of measurement
  • That gives us our next measure of dispersion and arguably the most important: standard deviation

Standard deviation

  • standard deviation is the square root of the variance
Standard deviation is:
[1] 5

Authorship, License, Credits

Creative Commons License